Bayesian ML Exam — Formula Cheatsheet

Organized by exercise type number. Every formula you need, nothing you don't.


Type 1: Bayes' Rule & Posterior Computation

Formula Expression
Bayes' Rule $p(\theta \mid x) = \dfrac{p(x \mid \theta) \cdot p(\theta)}{p(x)}$
Proportional form $p(\theta \mid x) \propto p(x \mid \theta) \cdot p(\theta)$
Evidence $p(x \mid m) = \int p(x \mid \theta, m) \cdot p(\theta \mid m) \, d\theta$
Joint $p(x, \theta) = p(x \mid \theta) \cdot p(\theta)$

Beta Function Trick (for integrals over $\theta \in [0,1]$)

How to use: Multiply likelihood × prior → read off powers $p$ and $q$ → plug into formula → multiply by any constant outside the integral.


Type 2: Beta-Bernoulli Coin Toss

Formula Expression
Bernoulli likelihood $p(D \mid \mu) = \mu^{N_1}(1-\mu)^{N_0}$
Beta prior $p(\mu) = \text{Beta}(\mu \mid \alpha, \beta) = \dfrac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} \mu^{\alpha-1}(1-\mu)^{\beta-1}$
Posterior $p(\mu \mid D) = \text{Beta}(\mu \mid \alpha + N_1,\; \beta + N_0)$
Evidence $p(D) = \dfrac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} \cdot \dfrac{\Gamma(\alpha+N_1)\Gamma(\beta+N_0)}{\Gamma(\alpha+\beta+N)}$
Predictive $p(x_{\text{next}}=1 \mid D) = \dfrac{\alpha + N_1}{\alpha + \beta + N}$
Beta mean $\mathbb{E}[\mu] = \dfrac{\alpha}{\alpha + \beta}$
Gamma (integers) $\Gamma(n) = (n-1)!$

Key: $N_1$ = count of ones, $N_0$ = count of zeros, $N = N_1 + N_0$. No binomial coefficient in the likelihood!


Type 3: Gaussian Posterior & Evidence

Setup: Likelihood $\mathcal{N}(x \mid \mu, \sigma^2)$, Prior $\mathcal{N}(\mu \mid \mu_0, \sigma_0^2)$

Formula Expression
Posterior variance $\dfrac{1}{\sigma^2_{\text{post}}} = \dfrac{1}{\sigma^2_0} + \dfrac{1}{\sigma^2}$
Posterior mean $\mu_{\text{post}} = \sigma^2_{\text{post}} \left( \dfrac{\mu_0}{\sigma^2_0} + \dfrac{x}{\sigma^2} \right)$
Evidence $p(x) = \mathcal{N}(x \mid \mu_0,\; \sigma^2_0 + \sigma^2)$

Shortcut: Posterior precision = sum of precisions. Posterior mean = precision-weighted average. Evidence variance = prior variance + likelihood variance.

When both variances = 1: $\sigma^2_{\text{post}} = 0.5$, $\mu_{\text{post}} = (\mu_0 + x)/2$.


Type 4: Model Evidence & Bayesian Model Averaging

Formula Expression
Evidence $p(x \mid m_k) = \int p(x \mid \theta, m_k) \cdot p(\theta \mid m_k) \, d\theta$
Model averaging $p(x) = \sum_k p(x \mid m_k) \cdot p(m_k)$
Gaussian evidence $p(x \mid m) = \mathcal{N}(x \mid \mu_0,\; \sigma^2_0 + \sigma^2)$

Type 5: Model Comparison & Bayes Factor

Formula Expression
Bayes Factor $B_{12} = \dfrac{p(D \mid m_1)}{p(D \mid m_2)}$
Posterior ratio $\dfrac{p(m_1 \mid D)}{p(m_2 \mid D)} = \dfrac{p(D \mid m_1)}{p(D \mid m_2)} \cdot \dfrac{p(m_1)}{p(m_2)} = B_{12} \cdot \dfrac{p(m_1)}{p(m_2)}$
BF from posteriors $B_{12} = \dfrac{p(D \mid m_1)}{p(D \mid m_2)} = \dfrac{p(m_1 \mid D)}{p(m_2 \mid D)} \cdot \dfrac{p(m_2)}{p(m_1)}$
Posterior model prob $p(m_k \mid D) = \dfrac{p(D \mid m_k) \cdot p(m_k)}{p(D)}$

Interpretation: Posterior ratio > 1 → model 1 wins. Evidence = ∫ likelihood × prior dθ.


Type 6: Bayesian Classifier

Formula Expression
Posterior $p(C_k \mid x) = \dfrac{p(x \mid C_k) \cdot p(C_k)}{p(x)}$
Evidence $p(x) = p(x \mid C_1) \cdot p(C_1) + p(x \mid C_2) \cdot p(C_2)$
Decision boundary $\dfrac{p(x \mid C_1) \cdot p(C_1)}{p(x \mid C_2) \cdot p(C_2)} = 1$
Error probability $P(\text{error}) = \int_{\text{decide }C_2} p(x \mid C_1)p(C_1)dx + \int_{\text{decide }C_1} p(x \mid C_2)p(C_2)dx$

Boundary shortcuts: - Different covariances → quadratic (parabola) - Shared covariance → linear (straight line)

Learned class prior (Beta): $p(C_1 \mid x_\bullet, D) \propto \int p(x_\bullet \mid C_1)\, p(C_1 \mid \theta)\, p(\theta \mid D)\, d\theta$


Type 7: Ball/Box Conditional Probability

Formula Expression
Product Rule $P(A, B) = P(A \mid B) \cdot P(B)$
Total Probability $P(A) = \sum_i P(A \mid B_i) \cdot P(B_i)$
Bayes' Rule $P(B_i \mid A) = \dfrac{P(A \mid B_i) \cdot P(B_i)}{P(A)}$

Without replacement: Update counts after each draw (decrease both the drawn type and total by 1).


Type 8: Gaussian Mixture Model (GMM) Form

Formula Expression
Joint (one-hot $z_n$) $p(x_n, z_n) = \prod_{k=1}^K \left(\pi_k \cdot \mathcal{N}(x_n \mid \mu_k, \Sigma_k)\right)^{z_{nk}}$
Marginal $p(x_n) = \sum_{k=1}^K \pi_k \, \mathcal{N}(x_n \mid \mu_k, \Sigma_k)$
Mixing constraint $\sum_{k=1}^K \pi_k = 1$

Spotting trick: Both $\pi_k$ AND $\mathcal{N}$ must be inside the parentheses raised to $z_{nk}$ (not $z_n$). Joint uses product ∏, marginal uses sum Σ.


Type 9: Variational Free Energy (VFE)

Formula Expression
VFE functional $F[q] = \int q(z) \log \dfrac{q(z)}{p(x,z)} \, dz$
Upper bound $F[q] \geq -\log p(x)$ for any $q(z)$
Equality $F[q] = -\log p(x)$ when $q(z) = p(z \mid x)$

Key: VFE minimizes $KL(q(z) \,|\, p(z \mid x))$ and gives an upper bound on negative log evidence.


Type 10: Free Energy Principle (FEP) — Concepts

Principle Statement
Generative model Agents MUST have an internal model of how sensory data is generated
Perception Minimizes free energy (update beliefs to match observations)
Action Minimizes expected free energy of future states
Goals Encoded as target priors in the generative model
Decision making Minimization of a functional of beliefs about future states

Rules of thumb: Always MINIMIZE (never maximize) free energy. Beliefs, not cost functions. Desired (target priors), not actual future.


Type 11: Factor Analysis & Marginal Gaussian

Formula Expression
Model $x = Wz + \epsilon$, $z \sim \mathcal{N}(0, I)$, $\epsilon \sim \mathcal{N}(0, \Psi)$
Conditional $p(x \mid z) = \mathcal{N}(x \mid Wz, \Psi)$
Joint $p(x, z) = \mathcal{N}(x \mid Wz, \Psi) \cdot \mathcal{N}(z \mid 0, I)$
Marginal $p(x) = \mathcal{N}(x \mid 0,\; WW^T + \Psi)$

Note: $WW^T$ (not $W^TW$) — must give $N \times N$ covariance for $x$.


Type 12: Recursive Filtering / Kalman

Formula Expression
Observation model $x_t = \theta + \epsilon_t$, $\epsilon_t \sim \mathcal{N}(0, \sigma_\epsilon^2)$
Likelihood $p(x_k \mid \theta) = \mathcal{N}(x_k \mid \theta, \sigma_\epsilon^2)$
Recursive update $p(\theta \mid D_k) \propto p(x_k \mid \theta) \cdot p(\theta \mid D_{k-1})$
Kalman gain $K_k = \dfrac{\sigma_{k-1}^2}{\sigma_{k-1}^2 + \sigma_\epsilon^2}$
Mean update $\mu_k = \mu_{k-1} + K_k \, (x_k - \mu_{k-1})$
Variance update $\sigma_k^2 = (1 - K_k) \, \sigma_{k-1}^2$

State-space update: $p(z_t \mid x_{1:t}) \propto p(x_t \mid z_t) \sum_{z_{t-1}} p(z_t \mid z_{t-1})\, p(z_{t-1} \mid x_{1:t-1})$

As $k \to \infty$: $\sigma_k^2 \to 0$, $K_k \to 0$, $\mu_k \to$ true value (stationary).


Type 13: Log-Likelihood & MLE

Formula Expression
Log-likelihood $\log p(D \mid \theta) = \sum_n \sum_k y_{nk} \log \mathcal{N}(x_n \mid \mu_k, \Sigma_k) + \sum_n \sum_k y_{nk} \log \pi_k$
MLE mean $\hat{\mu}k = \dfrac{\sum_n y$} x_n}{N_k}$ where $N_k = \sum_n y_{nk
MLE covariance $\hat{\Sigma}k = \dfrac{1}{N_k} \sum_n y_k)^T$} (x_n - \hat{\mu}_k)(x_n - \hat{\mu
Log-Gaussian $\log \mathcal{N}(x \mid \mu, \Sigma) = -\frac{1}{2}(x-\mu)^T \Sigma^{-1} (x-\mu) + \text{const}$

Covariance trick: Must use outer product $(x-\mu)(x-\mu)^T$ (gives matrix), NOT inner product $(x-\mu)^T(x-\mu)$ (gives scalar).

Decision boundary: Different covariances → quadratic (parabola). Shared covariance → linear (straight line).


Type 14: GMM, VFE, FEP & Concept Questions

VFE Properties

  • $F[q] \geq -\log p(x)$ — upper bound on negative log evidence
  • Equality at $q(z) = p(z \mid x)$ (true posterior, NOT the prior)
  • Minimizing VFE = minimizing KL divergence to true posterior

Bayesian vs MLE

  • Likelihood is a function of parameters: $L(\theta) = p(D \mid \theta)$
  • MLE maximizes $p(D \mid \theta)$; MAP maximizes $p(D \mid \theta)p(\theta)$ — equal only with uniform prior
  • As data grows: likelihood narrows, prior stays fixed → MLE ≈ MAP
  • Bayesian evidence = fit minus complexity (built-in overfitting protection)

Gaussian Properties

  • Linear combination of Gaussians → Gaussian
  • Products/ratios of Gaussians → NOT Gaussian

Discriminative Predictive

Generative Model for Signal Recovery


Quick-Reference Concept Facts

  • Bayesian methods are NOT faster than MLE — often more computationally expensive
  • No train/test split needed in Bayesian approach — all data used for inference
  • Beta is the conjugate prior for Bernoulli (not Gaussian)
  • MLE covariance uses $1/N_k$ (not $1/N$), and requires $y_{nk}$ selector

Gamma Factorial Quick Values

$n$ $\Gamma(n) = (n-1)!$
1 1
2 1
3 2
4 6
5 24
6 120
7 720
8 5040
10 362880